Programming

SIMD instructions in Java

Published: 28.12.2023
Java

With the realease of Java 16 it is now possible to use SIMD (Single Instruction Mutlitple Data) instructions to harness the full power of the CPU. This is made possible by the Vector API.
Because the API is still an incubator (non-final) feature, the following arguments have to be passed to the VM to use it: --enable-preview --add-modules jdk.incubator.vector

Terminology - shapes and species

The shape of a vector is its size in bits (e.g. 256). A species is a combination of an element type (int, float, ...) and a shape.
The optimal shape (size in bits) and thus the optimal species of a vector varies from CPU to CPU, which is why each subclass of Vector has a static SPECIES_PREFERRED attribute to get the optimal species. The call IntVector.SPECIES_PREFERRED for example, returns the optimal species for processing integers on the currently running Java platform.
Finally, the number of lanes in a vector is the number of its elements. For instance, a DoubleVector with a shape of 256 bit has 4 lanes (because 64*4 = 256).

Example: Adding a number to each element of an array

One simple use case for SIMD instructions is adding one number to each element of an array:

import jdk.incubator.vector.*; ... //get the preferred species for float vectors final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED; float[] array = new float[29]; Arrays.fill(array, 5.0f); float[] result = new float[array.length]; float addValue = 10.0f; /*instead of adding 10.0f, we could also add a vector that has the value 10.0f for each element: FloatVector addValue = FloatVector.broadcast(SPECIES, 10.0f);*/ int i = 0; //declare i here so we can use it after the first for loop for(; i < SPECIES.loopBound(array.length); i += SPECIES.length()) { //get a float vector from the array at index i FloatVector vec = FloatVector.fromArray(SPECIES, array, i); //add 10.0 to each element of the vector FloatVector sum = vec.add(addValue); //put the result in the result array at index i sum.intoArray(result, i); } //process the rest of the array for(; i < array.length; i++) { result[i] = array[i] + addValue; }

Using Masks

Most operations also accept a mask. A VectorMask<E> has a boolean value for each lane. An operation will only be executed on the lanes on which the mask is true. For example, we could multiply each value of an array by 2 but only if the number is less than 50:

//get the preferred species for float vectors final VectorSpecies<Integer> SPECIES = IntVector.SPECIES_PREFERRED; //fill array with random numbers SplittableRandom random = new SplittableRandom(); int[] array = new int[42]; for(int i = 0; i < array.length; i++) { array[i] = random.nextInt(100); } int[] result = new int[array.length]; int i = 0; for(; i < SPECIES.loopBound(array.length); i += SPECIES.length()) { //get an int vector from the array at index i IntVector vec = IntVector.fromArray(SPECIES, array, i); //get the mask that is true for each lane where the value of vec is less //than 50 VectorMask<Integer> mask = vec.lt(50); //multiply by 2 (only on the lanes where the mask is true) and write the //result to the array vec.mul(2, mask).intoArray(result, i); } //process the rest of the array for(; i < array.length; i++) { int num = array[i]; result[i] = (num < 50) ? (num * 2) : num; }

Be aware, that sometimes the JVM might already optimize your code so that implementing your own vectorization might not bring any performace increase. Additionally, the performance increase can greatly differ from CPU to CPU. For example, a CPU that supports the AVX-512 instruction set can handle up to 512 bits per instruction which is twice as much as with AVX2.


Sources

[1]
Oracle - JDK 21 Docs
[2] Baeldung